QTM 447 Lecture 8: Intro to Neural Networks

Kevin McAlister

February 6, 2025

Machine Learning

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Recall that the goal of supervised machine learning is to learn some function that generally does a good job of mapping the input features to an outcome.

Generically, we assume that:

\[y = f(\mathbf x) + \epsilon\]

  • There exists a true function out there

Machine Learning

We would like to find an approximation to this function, \(\hat{f}(\mathbf x)\), that minimizes the generalization error:

\[\hat{f}(\mathbf x) = \underset{f^*(\mathbf )}{\text{argmin }} E_\mathcal T[\mathcal L(y - f^*(\mathbf x))]\]

Unfortunately, we don’t get to see this quantity and need to approximate it.

Machine Learning

Most commonly, we rely on the empirical risk minimization framework. Given training data:

\[E_\mathcal T[\mathcal L(y - f^*(\mathbf x))] = \expn \mathcal L(y_i - f^*(\mathbf x_i)) + \frac{\lambda}{N} C(f^*(\mathbf x))\]

where \(C()\) is a measure of the complexity of the approximating function.

  • Per the VC dimension proof, we know that the gap between the training error and the generalization error goes to 0 as \(N \to \infty\)

Universal Approximators

Instead, let’s restrict ourselves to learning methods that are universal approximators

Universal Approximator:

A learning method that can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error.

Deciphering the nerd speak:

A universal approximator can uncover any mapping of \(\mathbf x\) to \(y\) that is continuous on a bounded subset of \(\mathbb R^P\)

Even less syllables:

It can learn any reasonable function

Universal Approximators

IMPORTANT!

Just because it can, doesn’t mean that it will.

Reason 1:

The optimization method used to try to learn the function can’t actually find the function

  • Non-convex loss spaces and local minima!

A practical reason that a universal approximator won’t find the truth.

Universal Approximators

IMPORTANT!

Just because it can, doesn’t mean that it will.

Reason 2:

The training algorithm may choose the wrong function due to overfitting

  • All we get is the training data, so a universal approximator may not uncover the minimum generalization error function

  • Confusing irreducible error for bias.

Universal Approximators

There’s a lot!

  • KNN

  • Explicit Polynomial Feature Transformations

  • Implicit Polynomial Expansions via Infinite Basis Kernels

  • Regression/Classification Trees

And neural networks with at least one hidden layer

  • There are a lot more methods that do this, just not used a lot for a variety of reasons

Universal Approximators

Problems:

  • Exploding parameter space

  • Poor generalizability

  • Inability to extend well to complex data types

Neural Networks

Let’s start with a simple example: the XOR problem

Neural Networks

\[\mathbf X = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix} \]

\[ \mathbf y = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 0 \end{bmatrix} \]

Goal: shatter the space with a classifier!

Starting point: linear classifier.

Does anyone remember the VC dimension of a linear classifier?

Neural Networks

There is no way that a linear classifier can shatter this space!

Can we get around this with polynomials?

Neural Networks

At a 5th degree polynomial, this works!

  • But, it seems a little silly

  • All we really need to do here is draw two lines instead of one.

Neural Networks

Neural Networks

Let’s make things simple (and non-differentiable) by stating an alternative loss function - the Heaviside function:

\[\mathcal L(\beta | \mathbf X, \mathbf y) = \expn I(\theta_i \ge 0)^{y_i} (1 - I(\theta_i \ge 0)^{1 - y_i})\]

\[\theta_i = \mathbf z_i^T \bb + b\]

where \(z_i^T\) is the set of features for instance \(i\).

  • This is essentially a deterministic version of logistic regression

Neural Networks

\(\theta_i\) is determined using a linear equation

  • This will lead to a single line…

Let’s consider a set of chained equations

\[h_{i1} = \mathbf w_1^T \mathbf x_i + c_1\] \[h_{i2} = \mathbf w_2^T \mathbf x_i + c_2\]

\[\theta_i = \bb^T \mathbf h_i + b\]

Neural Networks

More compactly noted, this chain can be represented as:

\[\theta_i = \bb^T(\mathbf W^T \mathbf x_i + \mathbf c) + b\]

where \(\bb\) is a 2 vector of weights, \(\mathbf W\) is a \(2 \times 2\) matrix of weights, \(\mathbf c\) is a 2 vector of bias terms (think intercepts), and \(b\) is an additional bias term.

This is a multilayer perceptron

  • We start with the input features

  • We pass the inputs to a hidden layer

  • The hidden layer is then passed to the output layer

Neural Networks

Let’s suppose that we know \(\mathbf W\) and \(\mathbf c\):

\[\mathbf W = \begin{bmatrix} 1 & 1\\ 1 & 1 \end{bmatrix} \]

\[\mathbf c = \begin{bmatrix} 0\\ -1 \end{bmatrix} \]

Then we can compute the values of \(\mathbf H\) in the hidden layer as:

\[ \mathbf H = \mathbf X \mathbf W + \mathbf c = \begin{bmatrix} 0 & -1\\ 1 & 0\\ 1 & 0\\ 2 & 1\\ \end{bmatrix} \]

Neural Networks

\(\mathbf h_i\) is the hidden representation of the original features given the coefficients for the hidden layer!

  • A feature transformation for \(\mathbf x_i\)

The goal of a multilayer perceptron is to get around a “bad” natural space where the instances cannot be separated linearly by engineering a new feature space

  • Typically through affine transformations of the original feature space!

  • Recall that this is the basis of the kernel trick; just not going from \(P \to \infty\) immediately…

The hope is that the new space admits a linear classifier!

Neural Networks

Neural Networks

That didn’t work…

The problem is that a linear hidden layer (no matter how big it is) is just a linear model with more steps…

\[\theta_i = \bb^T(\mathbf W^T \mathbf x_i + \mathbf c) + b\]

\[\theta_i = \bb^T\mathbf W^T \mathbf x_i + (\bb^T\mathbf c + b)\]

Taking a long walk for a short drink of water.

Neural Networks

The trick: activation functions

Use some function to create a nonlinear mapping of the feature space to the hidden space.

We’ll start with the most common one called the restricted linear unit - ReLU for short.

Given a scalar input

\[\varphi(x) = \text{max}(0, x)\]

We’re taking \(x\) and putting a kink in it at 0!

  • Remember what other method did this? It’s sort of like putting a knot in a rope to change its direction

Neural Networks

We typically apply ReLU elementwise to a vector or matrix

\[ \mathbf H = \mathbf X \mathbf W + \mathbf c = \begin{bmatrix} 0 & -1\\ 1 & 0\\ 1 & 0\\ 2 & 1\\ \end{bmatrix} \]

\[ \mathbf H = \varphi\left(\mathbf X \mathbf W + \mathbf c\right) = \begin{bmatrix} 0 & 0\\ 1 & 0\\ 1 & 0\\ 2 & 1\\ \end{bmatrix} \]

Neural Networks

Neural Networks

Neural Networks

We’ve separated the points in the hidden space!

  • This means that we can shatter this space.

To finish this specification, we set:

\[\bb^T = [1,-2]\]

and \(b = 0\).

Therefore, the decision boundary will be any point where:

\[\varphi\left(\mathbf X \mathbf W + \mathbf c\right)\bb + b = 0\]

Neural Networks

Neural Networks

We got it right!

How many parameters did it take?

  • 4 weights in the hidden layers

  • 2 bias terms

  • 2 “logistic” weights

  • 1 “logistic” bias term

9 terms (for four inputs…)

The 5th degree polynomial model took 21 parameters!

Neural Networks

Neural Networks

Neural Networks

Why not have three hidden units in the hidden layer?

  • There’s nothing saying we can’t!

Neural Networks

A flexible specification of a single layer fully connected neural network (FCNN).

Let \(\mathcal L(\boldsymbol \theta | \mathbf x, y) = g(\boldsymbol \theta , \mathbf x, y)\) be a generic loss function with unknowns \(\boldsymbol \theta\).

Let \(\mathbf x\) be an arbitrary input vector of length \(P\).

Neural Networks

\[\theta = \boldsymbol \beta^T \phi(\mathbf x; \mathbf W, \mathbf c) + b\]

  • \(\mathbf W\) is a weight matrix with number of rows equal to \(P\) and number of columns equal to the number of hidden units, \(K\)

  • \(\mathbf c\) is a \(K\) vector of bias terms

  • \(\phi()\) is a function applied to the combination of \(\mathbf x\), \(\mathbf W\), and \(\mathbf c\) that induces nonlinearities

  • \(\boldsymbol \beta\) is a \(K\)-vector of coefficients used to compute the loss.

  • \(b\) is a bias term for the loss layer.

Neural Networks

Most commonly, the single layer FCNN is set up with a linear predictor in the hidden layer:

\[\theta = \boldsymbol \beta^T \phi(\mathbf W^T \mathbf x + \mathbf c) + b\]

  • This is the most “basic” FCNN

  • Surprisingly flexible as \(K\) gets larger!

Neural Networks

Given a choice of \(K\) and a loss function, we can find the coefficients that minimize the loss!

  • Not all that different than what we’ve seen before.

Let’s start with a two-class classification problem.

Neural Networks

A chain of four equations! \[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = - \expn y_i \log \theta_i + (1 - y_i) \log ( 1 - \theta_i)\]

\[\theta_i = \sigma(\mathbf z_i^T \boldsymbol \beta)\]

\[\mathbf z_i = \varphi(\mathbf h_i)\]

\[\mathbf h_i = \mathbf x_i^T \mathbf W\]

  • Getting rid of the bias terms for simplicity. In this setup, just add a 1 to the front of the layer input to get the bias term.

  • Let’s let \(\varphi()\) be the ReLU function:

\[\varphi(x) = \text{max}(0, x)\]

Neural Networks

Top layer: Logistic regression over the \(K\)-dimensional hidden space

What is the hidden space?

Think about this as a familiar problem.

\[\underset{N \times K}{\mathbf H} = \underset{N \times P}{\mathbf X} \underset{P \times K}{ \mathbf W}\]

  • Create a linear mapping of the original space to a different dimensional space

Neural Networks

By virtue of including the ReLU nonlinearity, amounts to a kind of nonlinear PCA

The biggest difference is what we’re optimizing to:

  • PCA finds an arrangement that optimally reduces redundancies

  • Hidden spaces for FCNNs find an arrangement that optimally classify observations!

Unlike PCA pre-processing, make the different dimensional hidden space a direct part of the problem!

Neural Networks

Our unknowns:

\[\bb \in \mathbb R^{K + 1}\]

\[\mathbf W \in \mathbb R^{P + 1 \times K}\]

We know that we’re unlikely to be able to find analytical minima here, so we’ll need to rely on the gradient to use SGD methods to optimize this function!

  • Good rule of thumb: The shape of the gradient for each unknown should equal the shape of the unknown!

Neural Networks

In theory, we could write our own implementation of SGD that finds a minimum point in this loss space!

In practice, we don’t really need to do this.

  • For simple FCNNs, we can use sklearn.neural_network.MLPClassifier

  • For larger and more complex FCNNs, we’ll need to use Keras/TensorFlow/PyTorch

Neural Networks

Neural Networks

from sklearn.neural_network import MLPClassifier

# Train a 1-layer neural network with 1 hidden unit
nn_clf = MLPClassifier(hidden_layer_sizes=(1,), activation='relu', solver='adam', random_state=42)
nn_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

# Compute the training accuracy
y_train_pred = nn_clf.predict(X_train)
training_accuracy = accuracy_score(y_train, y_train_pred)

# Create a grid of points
xx, yy = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500))

# Use the trained model to predict the class
Z = nn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Create a contour plot
plt.figure()
contour = plt.contourf(xx, yy, Z, cmap='viridis', alpha=0.8)

# Set the title of the plot to the number of hidden units and the training accuracy
plt.title(f'Hidden Units: {nn_clf.hidden_layer_sizes[0]}, Training Accuracy: {training_accuracy:.2f}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Neural Networks

Each hidden unit for a single layer neural network is a partition in the original feature space

  • Due to the ReLU activation, one side of the partition maps to 0 on the hidden unit

  • The other side maps to its linear predictor

Neural Networks

Neural Networks

Neural Networks

from sklearn.neural_network import MLPClassifier

# Train a 1-layer neural network with 1 hidden unit
nn_clf = MLPClassifier(hidden_layer_sizes=(2,), activation='relu', solver='adam', random_state=42)
nn_clf.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

# Compute the training accuracy
y_train_pred = nn_clf.predict(X_train)
training_accuracy = accuracy_score(y_train, y_train_pred)

# Create a grid of points
xx, yy = np.meshgrid(np.linspace(-3, 3, 500), np.linspace(-3, 3, 500))

# Use the trained model to predict the class
Z = nn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Create a contour plot
plt.figure()
contour = plt.contourf(xx, yy, Z, cmap='viridis', alpha=0.8)

# Set the title of the plot to the number of hidden units and the training accuracy
plt.title(f'Hidden Units: {nn_clf.hidden_layer_sizes[0]}, Training Accuracy: {training_accuracy:.2f}')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

If we let the number of hidden units go to \(\infty\), we’ll recover the nasty classifier!

A single layer NN with enough hidden units is a universal approximator

Why did I stop at 1000?

  • The NN compute time scales in the number of hidden units rather than \(P\) (more or less)

  • Any intuition?

Neural Networks

The question is how does a single layer neural network interpolate arbitrary functions?

  • With ReLU activation functions on each hidden unit

Neural Networks

Neural Networks

A chain of three equations! \[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn (y_i - \mathbf z_i^T \bb)^2\]

\[\mathbf z_i = \varphi(\mathbf h_i)\]

\[\mathbf h_i = \mathbf x_i^T \mathbf W\]

  • Single layer NN with \(K\) hidden units and MSE loss

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

What we end up with is a set of local linear approximations

  • Each region defined by the ReLU functions over the hidden space (areas where the ReLU does not evaluate to zero) has the same slope

  • With more and more units, we get more and more linear approximations

  • At a certain granularity, what is the difference between lines and curves? Another victory for Taylor!

Neural Networks

We can uncover any arbitrary function using a single layer NN with \(K\) arbitrarily large!

This gets computationally unruly, though

  • We’re, more or less, computing \(K\) \(P\)-dimensional linear regressions

  • \(K = 10000\) is a lot of linear regressions!

Neural Networks

There’s a reasonable improvement - add more layers of hidden units.

\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = - \expn y_i \log \theta_i + (1 - y_i) \log ( 1 - \theta_i)\]

\[\theta_i = \sigma(\mathbf z_i^T \boldsymbol \beta)\]

\[\mathbf z_{i} = \varphi(\mathbf h_{i})\]

\[\mathbf h_{i} = \mathbf q_i^T \mathbf W_1\]

\[\mathbf q_i = \varphi(\mathbf g_i)\]

\[\mathbf g_i = \mathbf x_i^T \mathbf W_2\]

Deep Neural Networks

A two layer, full connected neural network (FCNN) can be succinctly defined in two equations:

\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]

\[\theta_i = g(\bb^T \varphi(\mathbf W_2^T \varphi(\mathbf W_1^T \mathbf x_i + \mathbf c_1) + \mathbf c_2) + b)\]

  • \(K_1\) hidden units in the first layer

  • \(K_2\) hidden units in the second layer

  • Two sets of hidden weight matrices: \(\mathbf W_2\) (\(P \times K_2\)) and \(\mathbf W_1\) (\(K_2 \times K_1\))

Deep Neural Networks

Why depth rather than width?

Deep Neural Networks

Let’s show some benefits via simulation:

  • Same training data

  • Two layer neural network with \(K_1 = K_2\) hidden units

  • One layer neural network with \(2 \times K_1\) hidden units

The total number of vectors of coefficients is the same!

  • We’ve just arranged them differently.

Deep Neural Networks

Deep Neural Networks

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import time

# List of numbers of hidden units
hidden_units = [2, 5, 10] + list(range(25, 101, 25))

# List to store the accuracies
accuracies = []
ttime = []

# For each number of hidden units
for h in hidden_units:
    start_time = time.time()
    # Create a MLPClassifier object
    clf = MLPClassifier(hidden_layer_sizes=(h, h), activation='relu', solver='adam', max_iter=1000, random_state=42)

    # Train the MLPClassifier
    clf.fit(X_train, y_train)

    # Predict the classes of the training set
    y_pred = clf.predict(X_train)

    # Compute the accuracy of the classifier
    accuracy = accuracy_score(y_train, y_pred)

    # Record the accuracy
    accuracies.append(accuracy)
    ttime.append(time.time() - start_time)

# List to store the accuracies for the single layer neural network
accuracies_single_layer = []
ttime_single_layer = []

# For each number of hidden units
for h in hidden_units:
    start_time = time.time()
    # Multiply the number of hidden units by two
    g = h*2

    # Create a MLPClassifier object with a single layer
    clf = MLPClassifier(hidden_layer_sizes=(g,), activation='relu', solver='adam', max_iter=1000, random_state=42)

    # Train the MLPClassifier
    clf.fit(X_train, y_train)

    # Predict the classes of the training set
    y_pred = clf.predict(X_train)

    # Compute the accuracy of the classifier
    accuracy = accuracy_score(y_train, y_pred)

    # Record the accuracy
    ttime_single_layer.append(time.time() - start_time)
    accuracies_single_layer.append(accuracy)

Deep Neural Networks

Deep Neural Networks

Deep Neural Networks

DNNs with an equivalent number of weight vectors to single layer NNs:

  • Achieve higher accuracy with less hidden units

  • Faster computationally than equivalent single layer model

For compute time, any intuition as to what’s going on here?

  • We’re optimizing with ADAM for both

Deep Neural Networks

How are we able to achieve higher accuracy with a DNN?

It really comes down to how it constructs the hidden units!

Deep Neural Networks

Deep Neural Networks

Unique patterns found: 2760

Deep Neural Networks

The top layer receives information and attempts to put the regions into an order that will directly minimize classification error

  • A smooth set of regions that attempt to capture probability patterns in the prediction surface

A 2 layer model breaks the problem into two steps

  • Find regions in the feature space that are similar regardless of spatial similarity

  • Classify in the reordered space that has already attempted to put like regions together in a smooth way!

Deep Neural Networks

DNNs work so well because they treat the feature extraction problem in a hierarchical way!

  • For classification, this is how a human would approach the problem

  • First, try to put like observations with like observations regardless of class

  • Then, apply classification to groups of instances instead of one at a time

  • Hierarchical pattern finding

Deep Neural Networks

Next class

  • More on DNNs

  • Lack of strict convexity and optimization difficulty

  • Backpropogation